Classification-Aware Hidden-Web Text Database Selection
نویسنده
چکیده
Many valuable text databases on the web have noncrawlable contents that are “hidden” behind search interfaces. Metasearchers are helpful tools for searching over multiple such “hidden-web” text databases at once through a unified query interface. An important step in the metasearching process is database selection, or determining whch databases are the most relevant for a given user query. Our algorithm is the first to construct In this paper we present algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source with a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms generate a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a relevance metric. Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query keywords in IR) and the source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation on the PubMed database and a TREC dataset show that we achieve order of magnitude improvement compared to the current baseline approaches.
منابع مشابه
Summarizing and Searching Hidden-Web Databases Hierarchically Using Focused Probes
Many valuable text databases on the web have non-crawlable contents that are “hidden” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically rel...
متن کاملClassifying and Searching Hidden-Web Text Databases
Classifying and Searching Hidden-Web Text Databases Panagiotis G. Ipeirotis The World-Wide Web continues to grow rapidly, which makes exploiting all available information a challenge. Search engines such as Google index an unprecedented amount of information, but still do not provide access to valuable content in text databases “hidden” behind search interfaces. For example, current search engi...
متن کاملDistributed Search over the Hidden Web: Hierarchical Database Sampling and Selection
Many valuable text databases on the web have non-crawlable contents that are “hidden” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically rel...
متن کاملSources Selection Methodology for Hidden Web Data Integration
In the internet-scale hidden web data integration, The problem of sources(web databases) selection has been a primary challenge. This paper proposes a novel approach for web databases selection of internet-scale hidden web data integration. This approach is based on a benefit function that evaluates how much benefit the web database brings to a given status of integration system by integrating ...
متن کاملAutomatic Hidden Web Database Classification
In this paper, a method for automatic classification of Hidden-Web databases is addressed. In our approach, the classification tree for Hidden Web databases is constructed by tailoring the well accepted classification tree of DMOZ Directory. Then the feature for each class is extracted from randomly selected Web documents in the corresponding category. For each Web database, query terms are sel...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014